from IPython.display import Image
The base data set from which we are starting has appoximately 35 features. The data set was cleaned and pre-processed for analysis, as outlined in the prior sections of this report - missing values identified, outliers dispositioned, and also all features re-scaled to standard normal distribution.
In the nominal data set some features are naturally scaled from 0 to 1 (real values), such as the Latent Dirichlet Allocation (LDA) measures, while other features are measured in the range of 0 - 800,000 ! (e.g. number of shares in social media context). Since both dimensionality reduction and cluster analyses depend on relative magnitudes, all features were mapped to standard normal distribution to provide even weighting of all features in the mapping / clustering processes. The binary features (e.g., is_data_channel_technology) are all retained as binary 0/1 valued features and one-hot encoded to similarly support evenly distributed distance evaluations among such categorical features.
Early efforts in which we attempted to use the cleaned data set and perform cluster analyses yielded results which did not provide straightforward interpretations of the clustering results. Visually, the cluster maps did not provide well-organized presentations of clusters and the silhouette and distortion metrics were generally disorganzied as a function of the number of clusters - these metrics were not smooth functions that indicated in any clear sense an optimal or even preferred number of clusters from those analyses. Methods attempted at that point included k-means, DBSCAN, and Spectral Clustering.
Thus, we were motivated to explore dimensionality reduction as a means to simplify the data set that we presented to the clustering algorithms. Evaluating choices for dimensionality reduction we considered Principal Components Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). Between the two methods, we decided to evaluate t-SNE.
Image("../cluster/t_sne_divergence_process_time.png")
Image("../cluster/t-sne_perplx_plots/perplex_0010.png")
Image("../cluster/t-sne_perplx_plots/perplex_0100.png", retina = True)
Image("../cluster/t-sne_perplx_plots/perplex_0200.png")
Having completed the t-SNE mapping, the next step in the process was to apply different clustering methods for evaluation of appropriate clustering results.
In our evaluation, we chose to evaluate with
These three methods have fundamental differences and we assessed that they can provide different opportunities on this data set to provide opportunities to provide at least some success.
#### K-Means Clustering
Thus, by standard measures, appropriate choices for number of clusters from this k-means clustering analysis is 12 or 13, and also it is reasonable to evaluate the clusters with k = 7.
Image("../cluster/cluster_kmeans_number_of_clusters_eval.png")
### K-Means Clusters Evaluation
To evaluate the resulting clusters for each of the above candidate values of k (7, 12, 13), the following approach is taken :
construct visual interpretation aid of a 3-plot set for each feature as shown in below figures. Each 3-set of plots includes the following :
Image("../cluster/cluster_kmeans_downselect_3way_preplx_100_12_clstrsln_LDA_00.png")
Image("../cluster/cluster_kmeans_downselect_3way_preplx_100_12_clstrsln_LDA_01.png")
Image("../cluster/cluster_kmeans_downselect_3way_preplx_100_12_clstrsln_LDA_02.png")
Image("../cluster/cluster_kmeans_downselect_3way_preplx_100_12_clstrsln_LDA_03.png")
Image("../cluster/cluster_kmeans_downselect_3way_preplx_100_12_clstrsln_LDA_04.png")
Image("../cluster/cluster_kmeans_cluster_barplots.png")